chore(dataobj): data object encoding and decoding #15676

rfratto · 2025-01-09T21:23:15Z

This commit introduces the encoding package with utilities for writing and reading a data object.

This initial commit includes a single section called "streams". The streams section holds a list of streams for which logs are available in the data object file. This does not hold the logs themselves, but rather just the stream labels themselves with an ID.

Encoding

Encoding presents a hierarchical API to match the file structure:

Callers open an encoder
Callers open a streams section from the encoder
Callers open a column from the streams section
Callers append a page into the column

Child elements of the hierarchy have a Commit method to flush their written data and metadata to their parent.

Each element of the hierarchy exposes its current MetadataSize. Callers should use MetadataSize to control the size of an element. For example, if Encoder.MetadataSize goes past a limit, callers should stop appending new sections to the file and flush the file to disk.

To support discarding data after reaching a size limit, each child element of the hierarchy also has a Discard method.

Decoding

Decoding separates each section into a different Decoder interface to more cleanly separate the APIs.

The initial Decoder is for ReadSeekers, but later implementations will include object storage and caching.

The Decoder interfaces are designed for batch reading, so that callers can retrieve multiple columns or pages at once. Implementations can then use this to reduce the number of roundtrips (such as retrieving mulitple cache keys in a single cache request).

encoding.StreamsDataset converts an instance of a StreamDecoder into a dataset.Dataset, allowing to use the existing dataset utility functions without downloading an entire dataset.

This commit introduces the encoding package with utilities for writing and reading a data object. This initial commit includes a single section called "streams". The streams section holds a list of streams for which logs are available in the data object file. This does not hold the logs themselves, but rather just the stream labels themselves with an ID. Encoding -------- Encoding presents a hierarchical API to match the file structure: 1. Callers open an encoder 2. Callers open a streams section from the encoder 3. Callers open a column from the streams section 4. Callers append a page into the column Child elements of the hierarchy have a Commit method to flush their written data and metadata to their parent. Each element of the hierarchy exposes its current MetadataSize. Callers should use MetadataSize to control the size of an element. For example, if Encoder.MetadataSize goes past a limit, callers should stop appending new sections to the file and flush the file to disk. To support discarding data after reaching a size limit, each child element of the hierarchy also has a Discard method. Decoding -------- Decoding separates each section into a different Decoder interface to more cleanly separate the APIs. The initial Decoder is for ReadSeekers, but later implementations will include object storage and caching. The Decoder interfaces are designed for batch reading, so that callers can retrieve multiple columns or pages at once. Implementations can then use this to reduce the number of roundtrips (such as retrieving mulitple cache keys in a single cache request). encoding.StreamsDataset converts an instance of a StreamDecoder into a dataset.Dataset, allowing to use the existing dataset utility functions without downloading an entire dataset.

rfratto · 2025-01-09T21:25:02Z

pkg/dataobj/internal/encoding/pools.go

+
+var protoBufferPool = sync.Pool{
+	New: func() any {
+		return new(proto.Buffer)


@cyriltovena I initially set proto.Buffer.SetDeterministic here to have deterministic encoding of protobufs but I think there's a bug in gogo protobuf that prevents it from working.

Either way, I think our encoding is already deterministic as long as we never include map types in our protobuf. I'll have some tests for that once we have the final pieces that tie everything together.

cyriltovena

LGTM

This PR introduces metadata for a "logs section," which is intended to hold a sequence of log records across one or more streams. The code is a near-identical copy of grafana#15676. Future work is needed to identify if dataset sections can have their encoding, decoding, and dataset implementations deduplicated.

rfratto requested a review from a team as a code owner January 9, 2025 21:23

pull-request-size bot added the size/XXL label Jan 9, 2025

rfratto force-pushed the dataobj-encoding branch from eb7c9cc to 2c48dc1 Compare January 9, 2025 21:30

rfratto commented Jan 9, 2025

View reviewed changes

rfratto requested review from cyriltovena and benclive January 9, 2025 21:30

cyriltovena approved these changes Jan 13, 2025

View reviewed changes

rfratto merged commit 896e138 into grafana:main Jan 13, 2025
59 checks passed

rfratto deleted the dataobj-encoding branch January 13, 2025 13:11

rfratto mentioned this pull request Jan 13, 2025

chore(dataobj): logs section metadata and encoding/decoding #15720

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(dataobj): data object encoding and decoding #15676

chore(dataobj): data object encoding and decoding #15676

rfratto commented Jan 9, 2025

rfratto Jan 9, 2025

cyriltovena left a comment

chore(dataobj): data object encoding and decoding #15676

chore(dataobj): data object encoding and decoding #15676

Conversation

rfratto commented Jan 9, 2025

Encoding

Decoding

rfratto Jan 9, 2025

Choose a reason for hiding this comment

cyriltovena left a comment

Choose a reason for hiding this comment